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Abstract — ? ] showed that a two-round variant of the EM algorithm can 
learn mixture of Gaussian distributions with near optimal precision with 
high probability if the Gaussian distributions are well separated and if the 
dimension is sufficiently high. In this paper, we generalize their theory to 
learning mixture of high-dimensional Bernoulli templates. Each template 
is a binary vector, and a template generates examples by randomly 
switching its binary components independently with a certain probability. 
In computer vision applications, a binary vector is a feature map of an 
image, where each binary component indicates whether a feature or 
structure is present or absent within a certain cell of the image domain. 
A Bernoulli template can be considered a statistical model for images of 
objects (or parts of objects) from the same category. We show that the 
two-round EM algorithm can learn mixture of Bernoulli templates with 
near optimal precision with high probability, if the Bernoulli templates 
are sufficiently different and if the number of features is sufficiently high. 
We illustrate the theoretical results by synthetic and real examples. 

1 Introduction 

During the past decades, a large number of theoretical 
results have been obtained for supervised learning such 
as classification and regression [? ]. For unsupervised 
learning, however, relatively few theoretical results are 
available. A main difficulty is that the objective func- 
tions in unsupervised learning are usually non-convex 
and multi-modal, so the optimization algorithms usually 
cannot find the global optima. As a result, it is generally 
difficult to obtain theoretical guarantees on the perfor- 
mances of the algorithms. A simple and typical example 
of unsupervised learning is clustering or learning mix- 
ture models, and a typical algorithm for fitting the mix- 
ture models is the EM algorithm [? ], which is a statistical 
counterpart of the k-mean algorithm. Although the EM 
algorithm is simple and interpretable, and is known to 
converge monotonically to a local mode of the observed- 
data log-likelihood, little is known about its theoretical 
performance in terms of correctly recovering the mixture 
components. As such, the EM algorithm is often called 
a heuristic algorithm. 

A major advance in the theoretical understanding of 
the EM algorithm for fitting mixture models was made 
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Fig. 1. Left: An alphabet of 18 sketch patterns. These 
sketch patterns are edge segments that connect the 
corners and mid-points of the sides of a squared cell. 
Middle: The image domain is partitioned into squared 
cells. Within each cell, any of the sketch patterns can 
present or absent. The whole feature map can be rep- 
resented by a binary vector, where each component is 
a binary decision on whether a certain sketch pattern in 
the alphabet is present or absent within a certain cell. 
Right: Some examples generated by the template in the 
middle by randomly switching the binary components with 
a certain probability. 

by ? ]. They proposed a two-round variant of the EM 
algorithm that consists of only two iterations of EM: the 
first iteration is initialized from a number of randomly 
selected training examples as the centers of the Gaussian 
distributions, and the second iteration is carried out after 
pruning the clusters learned from the first iteration. They 
showed that the two-round EM can learn the mixture 
of Gaussian distributions with near optimal precision 
with high probability if the Gaussian distributions are 
well separated and if the dimensionality of the Gaus- 
sian distributions is sufficiently high. Here near optimal 
precision means that one can estimate the parameters of 
the Gaussian distributions as if the memberships of the 
observations are known. 

In this paper, we generalize the theory of ? ] to learning 
mixture of Bernoulli templates. Each template is a bi- 
nary vector, and it generates examples by independently 
switching its binary components with a certain proba- 
bility. So the observed examples are also binary vectors. 
In potential applications in computer vision, a binary 
vector is a feature map of an image, where each binary 
component indicates whether a feature or structure is 
present or absent within a certain cell of the image 
domain. Fig. [I] illustrates the basic idea by a synthetic 
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Fig. 2. Real images and their binary sketches. Each bar in 
response above a threshold within a local cell of the image 

example. The image domain is equally partitioned into 
squared cells (in the example in Fig. [TJ there are a total 
of 9 x 9 = 81 cells in the image domain). There is an 
alphabet of sketch patterns that can appear in these 
cells (Fig. [I] shows an alphabet of 18 types of sketch 
patterns). Each cell may contain one or more sketch 
patterns, so the binary vector for each image consists of 
9 x 9 x 18 binary components, each component indicates 
whether a certain sketch pattern is present or not within 
a certain cell. Specifically, each component is a binary 
decision that can be based on local edge detection, Gabor 
filter responses, beamlet transformation [? ] or a pre- 
trained classifier. The formulation is very general. One 
can design any alphabet of local features or patterns, and 
one can use any binary detector or classifier to decide the 
presence or absence of these features within each cell. 
The whole feature map is a composition of local image 
features and is in the form of a binary vector. A template 
itself is a binary vector that is subject to component-wise 
switching or Bernoulli noise to account for the variations 
of the feature maps of individual images. The reason we 
focus on binary feature maps in this article is that they 
are easy to design and we do not need to make strong 
assumptions on their distributions such as Gaussianity. 

As another illustration, Fig. [2] displays some examples 
of real images and their binary sketches based on a 
simple design of image features and binary decision 
rule. We partition the image domain into squared cells 
of equal size (in these images, the cells are relatively 
small, ranging from 5x5 pixels to 7 x 7 pixels). We 
convolve the image with Gabor filters at 8 orientations. 
Within each cell, at each orientation, we pool a local 
maximum of the Gabor filter responses (in absolute 
values). If the local maximum is above a threshold, we 
then declare that there is a sketch within this cell at 
this orientation, and the sketch is depicted by a bar in 
the corresponding binary sketch image in Fig. [5] Clearly 
the sketch image captures a lot of information in the 
corresponding original image. 

Now back to the issue of learning mixture models by 
EM. We assume that there are k Bernoulli templates, 
and each observed example is a noisy observation of 
one of the k template. The question we want to answer 
is: given a number of training examples that are noisy 
observations of the k templates, whether a EM-type al- 
gorithm can reliably recover these k templates with high 
probability? The reason we are interested in this question 




the sketch image indicates the existence of a Gabor filter 
and at the same orientation as the bar. 

is that it will shed light on unsupervised learning of tem- 
plates of objects (or their parts) from real images, which 
is a crucial task for object modeling and recognition in 
computer vision. Many learning methods are based on 
fitting mixture models by EM-type algorithms, including 
the popular deformable part model [? ]. In the language 
of And-Or graph [? ] for object modeling, each template 
is an And-node, which is a composition of a number of 
sketches. The mixture of k templates is an Or-node, with 
each template being its child node. So the mixture of the 
templates is an Or- And structure. The theoretical results 
in this paper will be useful for us to understand the 
learning of the Or- And structure from training images. 

To answer the above question, we shall generalize the 
theory of ? ] to Bernoulli distributions, and we shall show 
that the two-round EM algorithm can learn mixture of 
Bernoulli templates with near optimal precision with 
high probability if the templates are sufficiently different 
and if the dimensions are sufficiently high. 

Generalizing the theory of ? ] from Gaussian mixtures 
to the mixtures of Bernoulli distributions is far from 
being straightforward. The sample space is no longer 
Euclidean, and some results for Gaussian distributions 
cannot be translated directly into those for the Bernoulli 
models. So we have to establish a theoretical foundation 
that is suitable for our purpose. 

The rest of the paper is organized as follows. Section 

1 describes the two-round EM algorithm and states the 
main theorem. Sections 2 to 4 present theoretical results 
that lead to the proof of the main theorem. Section 5 
illustrates the theoretical results by some experiments on 
synthetic and real examples. Section 6 concludes with a 
discussion. In the text, we shall only state the theoretical 
results. The proofs can be found in the supplementary 
materials. 

2 Two-round EM with performance 

GUARANTEE 

2.1 Model and algorithm 

Let P be a template. It is an n-dimensional binary vector, 
i.e., P G 11 = {0, l} n . In the example in Fig. 1, n = 9 x 
9 x 18 = 1458. Let P(s) be the 5-th component of P, s = 
1, n. An example x generated by P is a noisy version 
of P, and we write x ~ P. Specifically, let x(s) be the 
s-th component of x. Then x(s) = P(s) with probability 
1 — q, and x(s) = 1 — P(s) with probability q, i.e., q 
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is the probability of switching a component of P, and 
it defines the level of Bernoulli noise. We assume that 
q G (0,1/2). We also assume that the components of x 
are independent given P. We call P a Bernoulli template 
because it is binary and is subject to Bernoulli noise. 

Let {Pi,i = 1,...,A:} be k Bernoulli templates with 
mixture weights {wi,i = We assume that k is 

given. Otherwise, k can be determined by some model 
selection criteria such as BIC [? ? ]. Let xi, ...,x m be m 
noisy observations of these k templates, where the noise 
level is q. The probability that Xj is generated by P^ is 
Wit and we let w m i n = min^^...^ Wi. We define fii to 
be the expectation of the examples generated by P ir i.e., 
Hi = E[xi] where x$ ~ P^. Let Si be the set of examples 
coming from the template P^. 

For two n-dimensional vectors P and Q, let D(P, Q) = 
Sr=i — Q( 5 )l ^ e tne ^1 distance between P and 
Q. Let Cij be the separation between P^ and Pj, i.e., 
D(Pi,Pj) = dij = ncij. 

Definition 1: The mixture is called c-separated if 



We shall show that if the separation c is sufficiently 
large, then the two-round EM algorithm will reliably 
recover {P^, i = 1, k}. 

We use the notation to denote the estimated P^. In 
the two-round EM, the first round initializes {t\°\z = 
1,...,/} to be I randomly selected training examples. 
The initial number of clusters, I, is greater than the 
true number k. Specifically, we let / = In — , 
where 5 is the confidence parameter to appear later, 
i.e., with probability 1 — 5, the algorithm will succeed 
in recovering the mixture components. According to the 
coupon collector problem, the I examples cover all the k 
clusters with high probability. We estimate the Bernoulli 
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J )/2n. Then we run one 



noise level as min^ 
iteration of EM. 

After the first iteration, we prune the clusters by a 
starvation scheme. The pruning process consists of two 
steps. In the first step, we remove all the templates 
{T- 1 ^} whose weights are smaller than a threshold 1/4/. 
In the second step, we keep only k templates that are far 
apart from each other. Specifically, we randomly choose 
a template. Then we iteratively add a template that is 
farthest away from the selected templates in terms of the 
minimum distance between the candidate template and 
the selected templates. We repeat this inclusion process 
until we get k templates. 

After the pruning process, we run another iteration 
of EM. The estimated templates from this second round 
EM are already near optimal as we will show. 

To be more precise, Algorithm 1 describes the two- 
round EM. In Step 9 the templates {T- 2 ^} are to be 
converted to binary by rounding to the nearest integer. 

2.2 Notation 

For the convenience of reference, the following summa- 
rizes the notation used in this paper: 



Algorithm 1 Two-round EM for Learning Bernoulli 
Templates 

Input: Examples xi, ...,x m G ft, m > N(5) 
Output: Templates Ti,i = 1, .., k 
[1] Initialize T- ^ as I random training examples 
[2] Initialize wf } = 1/1 and q < 1/2 such that 

^o(l-^o) = ^min^(Tf ) ,Tf). 

2n 1,3 

[3] E-Step: Compute for each i = 1, I 

fi{xj) = q K J > * J (l-qo) n U{x » Li j ,j = l,...,m, 



Pi ( x i) = ~ ,(0), 7 >3 = -' m 



[4] M-Step: Update 



w, 



(i) 



^ m 



[5] Pruning: Remove all T-^with < w T = ^ 
[6] Pruning: Keep only k templates T- 1 ^ far apart. 
[7] Initialize = 1/k and q\ = qo. 
[8] E-Step: Compute 

/,( Xi ) = q?&> T <\l - qi r~ D ^\j = 1, ...,m 

(2), x w » a) /i( x j) ■ n 

Pi ; (*i)= (1) , 3 . - ,J = l,-,m 

[9] M-Step: Update m 

' J i 2) = Zl P i 2) ( X ^/ m ' 



W 



3 = 1 



n m 



mw, 



i j = l 



• n is the dimension of Bernoulli templates, which 
generate examples in ft = {0, l} n . 

• q G (0, 1/2) is the level of noise 

. B=l(l- 2g) lnl>0, 



E = min 



1 § c(l - 2g) - 2g 
2' c(l-2g) + 2<? 



• i^ m i n : the minimum of the mixture weights. 

• Pi is the i-th Bernoulli template 

• Si is the set of examples coming from the template 

Pi- 
rn is the separation between the Bernoulli tem- 
plates, D(Pi,Pj) = dij = ncij 

• c = miri^- 

• I is the initial number of mixture components I = 

12 In ~f ~ — . 5 is the confidence parameter in The- 
orem 1. 

• w t = ^ is the threshold for pruning the clusters 
learned by the first round. 

• d collects the templates that are initialized from 
examples in the i-th cluster Si and survive the 
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pruning process after the first round of EM, i.e. 

Ci = {t$\t$> es„ «#>>«*} 
2.3 Conditions 

We assume the following conditions hold: 

CO: c > WQ 



CI: c> 



3(1 - 2g) 



or equivalently E > 



C2 
C3 
C4 



n(l - 2q) > max(24, — In — ) 

cB c 
nc 2 (l-2q) 2 > 3456(7 In Sel 

mc 2 (l - 2q) 2 > 27648<?/( - + — 
V 2 n 



The above conditions require that the Bernoulli tem- 
plates are sufficiently different from each other, and that 
the dimension n and the number m of training examples 
are sufficiently large. 

2.4 Main result 

Theorem 1: If m examples are generated from a mix- 
ture of k Bernoulli templates under Bernoulli noise of 
level q and Wi > w min for all i. Let e, S G (0,1). If 
conditions CO — C4 hold and in addition the following 
conditions hold 

1) The initial number of clusters is 



I 



12 



In 



5w„ 



8 12k 

2) The number of examples is m > In — — . 

^ r^i . . 8 . on 

3) The separation is c > — ^ In 



4) The dimension is 



nB cw n 



( 3 i 18m " Ol 
n > max — - m — - — , 2 m 

\qE 2 6 



12k 



Then with probability at least 1 — 5, the estimated tem- 
plates after the round 2 of EM satisfy: 

D(Tf\p,) < £(mean(Si),Pi) + eg 

The above theorem states that with high probabil- 
ity, the estimated templates from the two-round EM is 
nearly as accurate as if we knew the memberships of the 
examples. 

3 Basic facts 

We shall first establish some basic facts about the 
Bernoulli templates perturbed by Bernoulli noise. They 
are concerned with the i\ distances among templates and 
their examples. 

Proposition 1: Let P,Q G ft be Bernoulli templates 
with noise level q. We have: 

1) If x - P then 

E[D(jc, P)] = nq, Var[D(x, P)] = nq(l - q) 



2) If x - P and y G ft then 

E[D(x,y)]=nq + D(P,y)(l-2q) 
Var[D(x,y)] = nq(l - q) 

3) If x, y - P then 

E[D(^y)]=2nq(l-q) 
Var[D(x, y)] = 2nq(l - q)(l - 2q + 2g 2 ) 

4) If x - P, y - Q ^ P then 

£[£>(x, y)] = 2rw/(l -q) + D(P, Q)(l - 2g) 2 
Far[D(x, y)] = 2ng(l - q)(l - 2q + 2<? 2 ) 

Proposition 2: Let P, Q G ft be Bernoulli templates 
with noise level We have: 

a) If x - P and A > 1 then 

P(L>(x,P) > Ang) < e -™KA-i) 2 /3 

b) If x - P and e G (0, 1) then 

P(|L>(x,P) - nq\ > enq) < 2e~ n ^ 2/3 

c) If x ~ P, y ~ Q and 

i/(P, Q) = 2ng(l - g) + £>(P, Q)(l - 2q) 2 
then for any e G (0, 1) 

P(|D(x,y) - ^(P,Q)| > ei/(P,Q)) < 2 e -^ p ' Q ) e2 / 3 

Prop. |2]states that the i\ distance between an example 
and its template is concentrated around nq, while the 
distance between two examples from two different tem- 
plates is concentrated around z/(P, Q). This leads to the 
following proposition. 

Proposition 3: Draw m samples from a c-separated 
mixture of k Bernoulli templates with mixing weights 
at least w min . Let eo > 0. Then with probability at least 
\ _ 2rn 2 e -2n<? ( 1-<? ) e2) / 3 — me~ nqe ^ 3 — ke~ mWrnin ^ 8 

a) For any x, y G Si we have 

D(^y)=2nq(l-q)(l±e ) 

b) For any x G Si, y G Sj, i ^ j, we have 

Z5(x, y) = n(2q(l - q) + Qj (l - 2g) 2 )(l db e ) 

c) For any x G S^ we have 

2?(x,P0 = ng(l±e ) 

D(x, P,) = n(q + Q,(l - 2g))(l db e ) 

d) Each \Si\ > \mwi. 

Lemma 1: Let Z ?; = ^ X^=i where B^ are 



Bernoulli r.v. with E\B^\ — q. Then 

n 

P(\^2Zi-nq\ > A) < 2exp(- 



mA 2 
3nq 



) 



Proposition 4: (Average of subsets) Draw a set Si of 
m examples randomly from template P G {0, l} n with 
noise level q < 1/2. Then with probability at least 1 — S 
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for any subset of size at least t > n there is no subset of 
Si of size at least t whose average fi has 



/ / me 1 11 

DU, P) > nq + a 3nq In h - In 2 + - In - 

y \ t t to 

Prop. [5] states that the sample average is unlikely to 
deviate too far from P. 

Proposition 5: (Weighted averages) For any finite set of 
points S C {0,l} n and weights w x G [0, l],x G S there 
exists a subset T c S such that 

1) m = LExE5^xJ 

2) D(/i T ,P) > ^(/i^,P) where 

Exes ™xx 



-|-^xand /i w = ^ 



Prop. [5] states that the weighted average can be 
bounded by unweighted average. 

4 Milestones of the Proof 

In this section we state the results that hold for the 
estimated template parameters after each EM iteration. 
We assume that conditions C0-C4 hold and that e < E. 

4.1 Initialization 

This section analyzes the initial estimates for the param- 
eters before the first round of EM. 

Proposition 6: With probability at least 1 — k(l + 

1) Every true template is represented by at least two 
initial estimates T 

2) The number of ^ coming from P^ is at most §Zk^ 

3) The noise estimate satisfies 

tfo(l-go) =q(l-q)(l±e ). 

By initializing from more templates than the actual 
number of clusters, there is high probability that the 
templates cover all the clusters. 

4.2 First Round of EM 

Proposition 7: Suppose T\ ( , 0) G Si and G S jf i ^ j. 
Then for any xGSj the ratio between the probabilities 
Pi and pj is 



P?(x) 



> exp(nQj5(l — 2q)) 



Prop. [7|states that the first round of EM is much more 
likely to assign the examples to the template of the same 
cluster than a different cluster. 

Proposition 8: Any non-starved estimate G Ci 

satisfies with probability 1 — e~ n l 2 

D(T^\l> i )<nq+^nc(l-2q) 

So the estimated template of a cluster is very likely to 
be close to the true template of this cluster. 



4.3 Pruning 

Proposition 9: The set d obeys the following proper- 
ties: 

a) Each Ci is non-empty 

b) There exists r G M such that for any x G Ci and 

y, z G Cj, j 7^ i we have £>(y, z) < r and £>(x, y) > 
r. 

c) The pruning procedure finds exactly one member 
of each C ? . 



4.4 Second Round of EM 

We permute the obtained templates so that G 

Si. 

Proposition 10: Suppose t[ 1] g Si and T^ 1} G S jf i ^ j. 
Then for any x G Si the ratio between the probabilities 
Pi and pj is 



pP(x) 
Pf } (x) 



> exp(-nQj(l — 2g) In — ) = exp(nQj£?/4) 



Theorem 2: Suppose that / > k and Wi > w m i n for all i 
and that conditions CO — C4 hold. Then with probability 
at least 1 - 2m 2 e- 2nq ^-^ e ^ 3 - me- nqe ^ 3 - ke~ mw ^^ - 
k(l + l)e- Zw -^ - ke~ lw ^l 12 - ke~ n l 2 , the estimated 
templates after the round 2 of EM satisfy: 

5 



D(jf\Vi) <D(mean(5<),P<)- 



-ncB/8 



nq 



We are now ready to prove Theorem [T] 
Proof o/ Theorem^ 

From / = In , we get ke~ lWmin ^ 12 < 



kSw m i n /2 < 5/2. Also 



lw r) 



k(l + i)e~ lWmin = 24k ^-e~ lw — < 

LZW rn i n 



^ 2^-lWmin+lWmin/ 12 ^ _ e ~ I W m i n / 12 ^ fi^Q 



Take e = E > (because of CI). From the dimension 

3 \Styi 2 2 
condition > — ^ In — - — we get 3m 2 e _nge o/ 3 < 5/6 so 

2m 2 e -2ng(l-g)e2/3 + me -n^/3 < 3^-7^/8 < £/ 6> 

12A: 

From the dimension condition n > 2 In — ^— we get 

ke~ n ' 2 < 5/12. 
From the number of examples condition, we get 

ke -mw min /8 < ^ 12> 

From Theorem |2j putting all of the above inequalities 
together and taking nc> — In , we obtain Theo- 
rem □ □ 



5 Experiments 

This section illustrates the theoretical results obtained 
in the previous sections by a simulation study as well 
as experiments on synthetic image sketches and real 
images. 
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5.1 Simulation study 

In this section we conduct experiments showing that in- 
deed, the true templates are found with high probability 
when the conditions of Theorem Q] hold. 

We will work with a mixture of two templates, Pi = 
and P2 = 1 containing all zeros and all ones respectively. 
The separation between these templates is maximal c = 
1. The mixture weights are equal w = w\ = w min = 0.5. 
We will study two levels of noise q G {0.1, 0.2} and three 
dimensions n G {10, 100, 2000}. 

Condition CI and the separation condition 3 from The- 
orem Q] are satisfied for both levels of noise. Condition 
C2 is only satisfied for n=100 and 1000, while condition 
C3 is only satisfied for n=1000. 
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Fig. 3. Success rates vs. number of training examples 
for learning from a mixture of two templates with the two- 
round EM and the standard EM algorithms. Left: Noise 
level q = 0.1. Right: Noise level q = 0.2. 

Fig. [3] plots the percentage of times the two templates 
are found exactly vs number of training examples by the 
two-round EM and the standard EM algorithms. For the 
standard EM we assumed the noise level is a known 
parameter. All results are obtained from 100 runs. 

Also shown in Figure [3] is the bound on probability 
1 - S > 1 — i2/ce _mwWri /k obtained from condition 2 
of Theorem [T] However, the dimension condition 4 of 
Theorem [l] is n > 1837 for q = 0.1 (assuming S = 0.1 and 
m = 100) and n > 64, 000 for q = 0.2, so it is violated by 
a large margin for q = 0.2. 

From the experiments we observe that the templates 
are found with high probability when the dimension n 
and the sample size m are large enough, even when 
some of the conditions of the Theorem Q] are violated. 

We also observe that when the dimension is large, 
the standard EM and the two-round EM produce iden- 
tical results. However, when the dimension is small, 
the two-round EM performs better than the standard 
EM, because it always finds the two templates given 
enough training examples, while the standard EM can 
fail sometimes. 

5.2 Synthetic image sketches 

In this experiment we work with a mixture of two 
Bernoulli templates, shown in the bottom row of Fig. 
[2J in a space of dimension n = 9x9xl8 = 1458. By 
perturbing the entries with Bernoulli noise with level q 
we obtain images such as those shown in the top row of 
Fig. [4] Fig. [5] shows the success rate of finding the two 
templates exactly using the two-round EM algorithm vs. 




Fig. 4. Top row: Examples of training images. Bottom 
row: the Bernoulli templates used to generate the training 
images. 

the number of training examples. The experiments are 
run for two levels of noise q e {0.1, 0.2} and two mixture 
weights Wmin e {0.2,0.4}. 
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Fig. 5. Success rates vs. number of training examples 
for learning from a mixture of two templates with the two- 
round EM algorithm for two levels of noise q e {0.1,0.2} 
and two mixture weights w min = 0.4 (left) and w min = 0.2 
(right). 

Also shown is the bound 1 - S > 1 - l2ke- mWmin ^ 
from condition 2 of Theorem Q] 

The separation between the two templates is quite 
small c = 0.02, because the two templates share a lot of 
zero components. So the separation conditions fail in this 
case. Since we are not in the conditions of the Theorem 
1, the bound on the training examples is not expected 
to hold. We may achieve a better bound if we reduce 
the dimension n while increasing c by selecting those 
features that differentiate the templates. In any case, we 
see that in the given scenarios the two templates can be 
recovered with 100% certainty with the two-round EM 
given sufficiently many examples. So Theorem [T] might 
hold under milder assumptions than ours. 

5.3 Experiments on real images 

We also did experiments on real images. Each image is 
first convolved with Gabor filters tuned to 8 orientations. 
Then the image domain is partitioned into equal sized 
squared cells (the size ranges from 5x5 pixels to 7 x 7 
pixels). Within each cell, at each orientation, we pool the 
maximum of the absolute values of the filter responses. 
If the maximum is above a threshold, we declare that 
there is a sketch within this cell at this orientation. Thus 
each cell produces a binary vector of 8 components. We 
then concatenate the binary vectors of all the cells into 
a big binary vector. So each image is transformed into a 
binary vector. 

We then use the two-round EM algorithm to cluster 
the images and learn a binary template for each cluster. 
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Fig. 6. Clustering wolves, deer, cats and rabbits. In each row, the first plot displays the learned template and the rest 
of the plots show some of the examples in the corresponding cluster. There are 15 images in each cluster. 

6 Discussion 



Fig. [6] to [8] show the results of three experiments (animal 
faces, animal bodies, and vehicles). In the learned tem- 
plates, the existence of a sketch at each cell is represented 
by a bar at the center of this cell and at the orientation 
of the sketch. In each experiment, there are 15 images in 
each cluster, and the two-round EM is able to separate 
the clusters perfectly. For the real images, the templates 
are denser than that in Fig. [4] because the numbers of 
cells are larger. 




This paper obtains theoretical guarantees on the perfor- 
mance of a two-round EM algorithm for learning mix- 
ture of Bernoulli templates, by generalizing the theory of 
? ]. Unlike the theoretical results for supervised learning, 
results on unsupervised learning such as clustering are 
relatively scarce. The results obtained in this paper can 
be useful for understanding the behavior of EM-type 
algorithms for unsupervised learning. 

In our future work, we shall improve the theoretical 
results by relaxing the conditions on the separation 
between the templates as well as the sample size. We 
shall also generalize Bernoulli templates to more general 
statistical models for images, such as templates with de- 
pendent switching of the binary components, as well as 
other non-Gaussian models such as exponential family 
models. 



Appendix: Proofs 



Clustering eagles, seagulls and horses. 



Currently we use a very simple sketch detector by 
thresholding the Gabor responses at different orienta- 
tions. We will design more sophisticated features and 
the associated detectors in future work. 



Proof of Prop. [I] 
1. We have 



£[£>(x, P)] = E[J2 B k ] = J2 W = 



nq 



k=o 



k=0 




and 



25[D(x, P) 2 ] = E[(J2 B k f] = E[J2 Bl + BiBj] 



k=0 



2=0 



Fig. 8. Clustering motorcycles, bicycles and cars. 



i=0 i^j 

Var(D(x, P)) = E[D(x, P) 2 ] - E[D(x, P)] 2 
= n(n — l)q 2 + nq — n 2 q 2 = nq(l — q) 

2. Let d = 2?(P,y). Wlog P = (A,B),y = (A, 1 - B) 
where B e {0, l} d and x = (u, z), u ~ A, z ~ B. Observe 
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that if two r.v. are independent then Var(A + B) = 
Var(A) + Var(B). Then 

E[D(x, y)] = E[D(u, A) + D(z, 1 — B)] = (n — d)q+ 
+ (d - £[£>(z, B)]) = (n-d)q + d-dq 
Var(D(x, y)) = Far[L>(u, A) + d - £>(z, B)] 
= Far[L>(u, A)] + Var[d - D(z, B)] 
= (n - d)g(l - g) + dg(l - q) = ng(l - g) 

3. In the case when x, y ~ P we have 

J B x>y [D(x ) y)]=i3 x [£; y [Z)(x ) y)]] 
= £ K [n g + D(x,P)(l-2 g )] 
= + n</(l — 2(/) = 2nq(l — q) 

Var x , y (£>(x,y)) = £ x , y [£>(x, y) 2 ] -(£^ y [£>(x,y)]) 2 
= ^(^^(x.y) 2 ]) - E x (E y [D(x,y)}) 
+ E x (E*[D( X ,y)}) - (E x [E y (D(x,y))}) 2 
= E x (Var y [D( X ,y)\) + Var x [E y (D(x,y))} 
= E x (nq(l - q)) + Var x [nq + D(x, P)(l - 2q)\ 
= nq(l - q) + nq(l - q)(l - 2q) 2 

4. In the case when x ~ P, y ~ Q we have 

E^ y [D(^y)}=E x [E y [D(^y)}} 

= E x [nq + D(x,Q)(l-2q)] 

= nq + (nq + £>(P, Q)(l - 2q))(l - 2q) 

= 2nq(l-q) + D(V,Qi)(l-2q) 2 

Var^ y (D(^y)) = E x (Var y [D(x,y)}) + 
+ Var^E y (D(x,y))} 

= E x (nq(l - q)) + Var^nq + D(x, Q)(l - 2g)] 
= - g) + ng(l - q)(l - 2q) 2 .□ 

Proof of Prop. [2] Statements a), b) follow directly from 
the Chernoff inequality. 

c) Let C be indices of the n — d common elements 
of P and Q. Let B { be the Bernoulli event that the z-th 
element of x and y are different. Then E(Bi) = 2g(l — q) 
if i G C and E(Bi) = q 2 + (1 - q) 2 ti i & C. Observe 
that £>(x, y) = Y^i=i Thus by the Chernoff inequality, 
since v = E[D(x, y)] = 2nq(l - q) + d(l - 2<?) 2 we get 

P(|L>(x,y) - i/| > ei/) < 2e- ue2/s . □ 

Proof of Prop.^ a) From point c) of Prop. [2] with P = Q, 
we have v = z/(P, P) = 2ng(l — q) so for any two points 
xj e 4 we have P(|D(x,y) - z/| > e ^) < 2e _i/e o/ 3 . 
Thus for all m(m — l)/2 combinations of two points we 
have 



P(|D(x, y) — v\ > 6$v) < m(m — l)e 



^/3 



< m 2 e -2n(?(l-g)e^/3 



b) Similar proof with a), with v = z/(P, Q) = 2ng(l — q) + 
d(P,Q)(l-2<?) 2 = 2n<2(l-g)+nc ij (l-2<2) 2 > 2rag(l-g). 

c) From point b) of Prop. [2] we have P(|D(x, Pi)— nq\ > 
tvnq) < 2e~ nqe °/ 3 so for all m points we have 



d) Let Bj be Bernoulli event that sample j is drawn 
from template P^. Then E[Bj] — Wi and from the 
Chernoff bound 

P(|^| < -m Wi ) = P < - -) 



< e 



-mwi(l/2) 2 /2 ^ e -mWmin/& Q 



Proo/ of Lemma [I] The mean of mn Bernoullis 5^ with 
E[Bij] = q (the coordinates of the Zi) satisfies 



\^^-q\>eq)<2e- mnqe ^ 3 



So 



p(iE z * 



> eng) < 2e- rnnqe2/s . 



and we take e = X/nq.H 

Proof of Prop. |i| First, it is sufficient to prove it for 
subsets of size exactly t, otherwise we increase t. Without 
loss of generality, we can assume P = 0. From Lemma 
[l]we have 

P(|£>(/i, P) - nq\ > A) < 2e- tx2/3nq 

The number of t-point subsets of *Si is (™) < (me ft) 1 
thus 

P(3 subset of t points s.t. D(n, P) — nq > A) 

< 2 (^)\-^/3n ? 



Solving for 2 ^f) e~ tx2 / 3nq = S we get 
A = 

therefore 



3 subset of t points s.t. P) — > 

/3no / , me , , 1\1 
>^^ln- + ln2 + ln-jj<^ 

Proo/ 0/ Prop |5| Sort the points x g S by D(x, P) 
l x * — an< ^ ta ^ e ^ as tne ones with |T| 
Exes w *\ l ar g est values. Then 

n n 



xGT 2=1 



XG5 



SO 



2=1 



T 



>e Exg ^ x|x : Pz| =^,P).a 



E 



xG5 



P(|£)(x, Pi) — ng| > eonq) < me' 



-nqel/3 



Proof of Prop. [6] Let Bi be the Bernoulli event that a 
random sample from the mixture comes from the z-th 
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true template P^. Then E[Bi] — W{. Having I random 
samples Bij from the Bernoulli event B if then 



From Prop. |5] there exists T C Si with \T\ = [mw T /2^ 
1J such that D{n T ,Q) > D(fi w ,0). From Prop g with 
probability 1 — e~ n l 2 



p (IZ B v <!) = (!- w i) 1 + Ml - w i) 1 

<(l + Z)(l-w min ) z <(Z + l)e"^ 



<^(/i T ,0) < 



so P(E^i^ > 2) > 1 - (Z + l)e-^« 
P(Pi is represented twice) > 1 — (Z + l)e _Zw ™ iri so 



P (Pi is represented twice, Vz = l,fc) > (1 
l) e -^--) fe > 1 - k(l + l) e -^— . 



1 . Thus 

(I 



< nq + W 3ng ( In 



2|S,|e , 2 , ln2+ ^n 



< 



/3<7 , 24gZTn2 1 N 

< nq + +n\ — InSel -\ — ( h -) < 

11 n m n 2 



2. From Chernoff bound we have P(E 7 =i Bij > 



< ng + m c 2 (l - 2<?) 2 ( 



3/2lwi) < /3^ w hich implies the results. 

3. As there exist T-,T^ representing the same cluster, 
then 2n<2o(l - qo) < £>(T^, T£) < 2ng(l - g)(l + e ) (from from conditions C3 and C4. 
Prop. [3] a). Also from Prop. [3] 



ik + ik ) = nq+ k nc{1 - 2q) 

a) 



2n 9o (l - 9b) = (2n<z(l - q) + n Cij (l - 2q) 2 )(l ± e ) 
>2nq(l-q)(l-e ) 

so both parts of the inequality are proved. □ 
Proof of Prop [7] . We have 



For the second term, from Prop [3] we have, for x £ Sj 

D(x,Pi) < (nq + nc ij (l-2q))(l + e ) 
where since eo < 0.5 we have 

Pi,D{x,Pi) < e- n ^ B ^- 2 "Hnq + ncy(l - 2q)){\ + e ) < 

< e -n Cij B(l-2q)/2 < e -ncB(l-2g)/2 



^(x) _ gg^^^ 

p (}) (x) ^(x,T(0)) (i _^ )n _ D(x , T (0) ) 



D(x,T^)-D(x,T^)^ 



(OK 



SO 



E^Exes^M^P*) i 



E x ^ 1} ( x ) 



with a 



1-90 



> 1 



. But from Prop. [5] 



D(x,T^)-D(x,Tn > (2n«(l - q) + n Cij (l - 2q) 2 )(l - e )- 



- 2nq(l - q)(l + e ) = 4nq(l - q)e + 
+ n Cij (1 - 2g) 2 (l - e ) > ncy(l - 2g) 2 /2 

since eo < 1/2. We also have 

^ 2 1/4 i i 



1 -ncB(l-2g)/2 < * (1 _ 2 ) 

48 

(2) 



< — e 

Wt 



(1 - go) . 
a = — r > 



> 



9o(l - 9o) " (Z(l - (Z)(l + eo) " 4g(l + 1/2) 6g 



using condition C2. Putting together ([!} and Q we get 
the result. □ 

Proof of Prop. [9] a). From Proposition [3] and [6] we have 
that |S$| > mwi/2 and at most 3lwi/2 initial centers are 
from Si. 

Let i' be such that T-^ G S$ and x G S^. For any j such 
that T^ 0) 5i we have from Prop [7] ^ ( / 1) (x)/^ 1) (x) > 



Proof of Prop. [5] Without loss of generality we can e n CiJ B > e ncB(i-2g) Thenp^(x) < e - ncB< ^- cl( i) an( j thus 



assume P; = 0. 



^(T^.Pi) 



E x 4 1} (x) 



^ 1J (x) > 1 - l e -ncB(l-2q) _ g ut then 



< 



Efc=iEx€SiPi' i; ( x ) x * , ELiEx^s^Mxfc 



E 



< 



< 



(1) _ ExesEfc^^'es.Pfe } ( x ) 
m 



> 



Ex4 1} (x) 



Ex4 1} (x) 



ELiEx e s,^ 1J ( x ) x fc ^ E j¥i Ex € 5,^ ; ( x )^( x ,o) 



^ |^|(l -^6-^(1-2,)) ^ «^ (i _ le . ncB(1 _ 2q)) 



ExP!' 1} (x) 

From Prop bl for any x g Sj, j 7^ z we have p^(x) < 

p -naj J B(l-2c?X < - -ncB(l-2q) JUp n 



"But |fe, T^ 1} G Si I < 3Z^/2 so there is a fe, T^ i; G Si such 
that 

m ^i(l-/e- ncB )/2 l _ l e -ncB(l-2q) j 

^ ^ = 3Z -4l = " T 

using condition C2, thus is not empty. 

.(1) rpC 4 ^ 
J v ' 3 



.(1) 



> m^ T - me" nCjB(1 " 2Q) > mw T /2 + 1 



b) Pick any if } G d and T^,T^ G G,- for i ^ j. 



from condition C2 and C4. 



Then from Proposition [8] we have 
2 

16 



^(T^T^) < 2ng + ^nc(l - 2q) = 2nq + ^nc(l - 2g) 
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while using Proposition [8] and the triangle inequality we 
get 

2 



D(Ty,Ty) > DiP^Pj) - 2nq - —nc(l - 2q) > 
>nc — 2nq — \nc(l — 2q) > 2nq + \nc(l — 2q) 



from condition CO, so we can take r = \nc. 

c) There are k true clusters, exactly as many as selected 
templates. If two selected templates were from the same 
cluster, there should be a cluster that has no selected 
templates. But the two templates from the same cluster 
are at distance at most r while the distance of a template 
from the unselected cluster has distance more than r, we 
get a contradiction. □ 

Proof of Prop. [To] . Using the triangle inequality, Prop. 
[3] and Prop. [8] we have 

D(x,Tf } ) < D(x,P0 +£>(Tf \P.) < 



(2) 

From Proposition 10 we have for x £ Si, p- ; (x) < 

pf ) (x)e- ncB / 4 < e -«cB/4 so 

pf )( X ) = 1 - 5>f (x) > 1 - ke- ncB ^ 

So the first term is bounded as: 

E xe s i pf ) W^(x,Pe) < £ xeSi (l " ke-^)D( X ,P t ) 



Ex€S 4 Pi 2) ( X ) 
„( 2 ) 



|5i|(l-/ce-™ cB / 4 ) 



Ex 6 5>i W - (1 - fce-" CB/4 ))^(x,P i ) 



< D(mean(S' i ),P 



Exes.pfV) 

Exes, fce- cB / 4 D(x,P,) 



< D(mean(5 i ),P i ) 



< nq(l + e ) + nq + ^ nc ( 1 ~ 2 ^) 



and 



-fce-" cS / 4 ) 
|5i|A;e- ncB / 4 n 9 (l + e) 
|Si|(l-fce- ncB / 4 ) 
< L>(mean(5 i ),P i ) + 2ke~ ncB ^nq 

when e < 1 - 2ke~ ncB / i . 

The second term is bounded as: 



D(x,Tf ) )>i?(x,P j 



■2?(Tf .P,-) 



> 



E^Ex eSj pi 2) W^(^Pi) 



-ncS/4 



> n(q + c^(l - 2g))(l - e ) - nq - — nc(l - 2q), 



2nge 



< 



16 



so 



< 



-ncB/8 



(2)/ \ D(x,T^ 1) ) /1 \n-ZXxT 

^ ; (x) _ g (1 -^o) n ^ x,i 



p( 4 h 



. 3 _ 
D/IN < — noe 



\Si\(l-ke- ncB / 4 ) 

ncB/8 



D(x,Tf')-B(x,Tr^ hen e -ncB/8 < q and fce -ncB/4 < 1/3 

From the inequality 

ke -ncB/8 < j < 



n(i) 



where a 

P t (2) (x) 
pf(x) 



1-go 

<70 



> 1, and therefore 



> exp([n(g + Cij(l - 2q))(l — e ) - ng(l + e )— we get the result. □ 



2nq nc(l — 2q)] In a) 



= exp(n[cij(l - 2q)(l - e ) - 2q(l + e )- 

- ^nc(l - 2<?)]lna) > exp(nc ij i(l - 2g)ln^) 

using condition CI. □ 

Proof of Theorem [2] First we compute the probabil- 
ity that the theorem holds. Proposition [3] holds with 
probability at least 1 - 2m 2 e' 271 ^ 1 -^ 6 ^ 3 - me"^/ 3 - 
fc e -mwmin/8 m Proposition [6] holds with probability at least 
1 - k(l + l) e ~ lw ^ - ke lw ™™/ 12 . Proposition^ holds with 
probability at least 1 — e~ n l 2 for each of the k clusters. All 
other propositions hold if these three propositions hold. 
Thus with probability l-2m 2 e- 2n ^ 1 -^ e o/3_ me -^/3_ 

ke -mw min /8 _ k y + X y-lw min _ ke lw min /12 _ ke ~n/2 ^ 

propositions hold for all clusters. 

Now we prove the distance inequality. Similar to the 
proof of Proposition [7] we have 

n (T ^ P 1 - pf ExPfV)* p , _ ExPf(x)I>(x,P,) 
E x Pi (x) Ex Pi (x) 

Ex eSi p| 2) W^(x,Pi) , E j¥ iEx 6Si pi 2) W^(^Pi) 



< 



Exe^pf^x) 



Exe^pf^x) 



